Recent years have witnessed the rapid progress of image captioning. However, the demands for large memory storage and heavy computational burden prevent these captioning models from being deployed on mobile devices. The main obstacles lie in the heavyweight visual feature extractors (i.e., object detectors) and complicated cross-modal fusion networks. To this end, we propose LightCap, a lightweight image captioner for resource-limited devices. The core design is built on the recent CLIP model for efficient image captioning. To be specific, on the one hand, we leverage the CLIP model to extract the compact grid features without relying on the time-consuming object detectors. On the other hand, we transfer the image-text retrieval design of CLIP to image captioning scenarios by devising a novel visual concept extractor and a cross-modal modulator. We further optimize the cross-modal fusion model and parallel prediction heads via sequential and ensemble distillations. With the carefully designed architecture, our model merely contains 40M parameters, saving the model size by more than 75% and the FLOPs by more than 98% in comparison with the current state-of-the-art methods. In spite of the low capacity, our model still exhibits state-of-the-art performance on prevalent datasets, e.g., 136.6 CIDEr on COCO Karpathy test split. Testing on the smartphone with only a single CPU, the proposed LightCap exhibits a fast inference speed of 188ms per image, which is ready for practical applications.
translated by 谷歌翻译
The number of international benchmarking competitions is steadily increasing in various fields of machine learning (ML) research and practice. So far, however, little is known about the common practice as well as bottlenecks faced by the community in tackling the research questions posed. To shed light on the status quo of algorithm development in the specific field of biomedical imaging analysis, we designed an international survey that was issued to all participants of challenges conducted in conjunction with the IEEE ISBI 2021 and MICCAI 2021 conferences (80 competitions in total). The survey covered participants' expertise and working environments, their chosen strategies, as well as algorithm characteristics. A median of 72% challenge participants took part in the survey. According to our results, knowledge exchange was the primary incentive (70%) for participation, while the reception of prize money played only a minor role (16%). While a median of 80 working hours was spent on method development, a large portion of participants stated that they did not have enough time for method development (32%). 25% perceived the infrastructure to be a bottleneck. Overall, 94% of all solutions were deep learning-based. Of these, 84% were based on standard architectures. 43% of the respondents reported that the data samples (e.g., images) were too large to be processed at once. This was most commonly addressed by patch-based training (69%), downsampling (37%), and solving 3D analysis tasks as a series of 2D tasks. K-fold cross-validation on the training set was performed by only 37% of the participants and only 50% of the participants performed ensembling based on multiple identical models (61%) or heterogeneous models (39%). 48% of the respondents applied postprocessing steps.
translated by 谷歌翻译
Video super-resolution is one of the most popular tasks on mobile devices, being widely used for an automatic improvement of low-bitrate and low-resolution video streams. While numerous solutions have been proposed for this problem, they are usually quite computationally demanding, demonstrating low FPS rates and power efficiency on mobile devices. In this Mobile AI challenge, we address this problem and propose the participants to design an end-to-end real-time video super-resolution solution for mobile NPUs optimized for low energy consumption. The participants were provided with the REDS training dataset containing video sequences for a 4X video upscaling task. The runtime and power efficiency of all models was evaluated on the powerful MediaTek Dimensity 9000 platform with a dedicated AI processing unit capable of accelerating floating-point and quantized neural networks. All proposed solutions are fully compatible with the above NPU, demonstrating an up to 500 FPS rate and 0.2 [Watt / 30 FPS] power consumption. A detailed description of all models developed in the challenge is provided in this paper.
translated by 谷歌翻译
非平行的多域语音转换方法(例如Stargan-VC)在许多情况下已被广泛应用。但是,这些模型的培训通常由于其复杂的对抗网络体系结构而构成挑战。为了解决这个问题,在这项工作中,我们利用最先进的对比学习技术,并将有效的暹罗网络结构纳入Stargan歧视者。我们的方法称为Simsiam-Stargan-VC,它提高了训练稳定性,并有效地防止了训练过程中的歧视者过度拟合问题。我们对语音转换挑战(VCC 2018)数据集进行了实验,并进行了用户研究,以验证我们的框架性能。我们的实验结果表明,Simsiam-Stargan-VC在客观和主观指标方面显着优于现有的Stargan-VC方法。
translated by 谷歌翻译
只有单个目标扬声器的语音供参考的单发语音转换(VC)已成为一个热门研究主题。现有作品通常会散布音色,而有关音高,节奏和内容的信息仍然混合在一起。为了进一步删除这些语音组件,有效地执行一声VC,我们采用随机重新采样用于音高和内容编码器,并使用互信息的各种对比对数比率上限和基于梯度反向层的对抗性相互信息学习来确保不同部分在训练过程中仅包含所需的分离表示的潜在空间。 VCTK数据集的实验显示该模型就自然性和智能性方面实现了一声VC的最新性能。此外,我们可以通过语音表示分离分别传递音色,音调和节奏的单发VC的特征。我们的代码,预训练的模型和演示可在https://im1eon.github.io/is2022-Srdvc/上获得。
translated by 谷歌翻译
非平行的多与众不同的语音转换仍然是一项有趣但具有挑战性的语音处理任务。最近,基于有条件的自动编码器的方法AutoVC通过使用信息限制的瓶颈来删除说话者身份和语音内容,从而实现了出色的转换结果。但是,由于纯粹的自动编码器训练方法,很难评估内容和说话者身份的分离效果。在本文中,一个新颖的语音转换框架,名为$ \ boldsymbol t $ ext $ \ boldsymbol g $ uided $ \ boldsymbol a $ utovc(tgavc),提议更有效地将内容和音色与语音分开,其中预期的内容嵌入其中根据文本转录生产的旨在指导语音内容的提取。此外,对对抗性训练将用于消除从语音中提取的估计内容中的说话者身份信息。在预期内容嵌入和对抗培训的指导下,对内容编码器进行了培训,以从语音中提取嵌入说话者的内容。 Aishell-3数据集的实验表明,所提出的模型在自然性和转换语音的相似性方面优于AUTOVC。
translated by 谷歌翻译
我们调查了无线网络中多个联合学习(FL)服务的数据质量感知动态客户选择问题,每个客户都有动态数据集,用于同时培训多个FL服务,每种FL服务都必须为客户付费。限制货币预算。在训练回合中,这个问题被正式化为不合作的马尔可夫游戏。提出了一种基于多代理的混合增强算法,以优化共同的客户选择和付款操作,同时避免采取行动冲突。仿真结果表明,我们提出的算法可以显着改善训练性能。
translated by 谷歌翻译
在6G无线通信网络中,按需服务提供是一个至关重要的问题,因为新兴服务的需求大大不同,并且网络资源变得越来越异质和动态。在本文中,我们研究了按需无线资源编排问题,重点是编排决策过程的计算延迟。具体而言,我们将决策延迟延迟到优化问题。然后,提出了一个基于动态的神经网络(DYNN)的方法,可以根据服务要求调整模型复杂性。我们进一步建立一个知识库,代表服务需求之间的关系,可用的计算资源和资源分配绩效。通过利用知识,可以及时选择DYNN的宽度,从而进一步提高编排的性能。仿真结果表明,所提出的方案大大优于传统的静态神经网络,并且在按需服务提供方面也表现出足够的灵活性。
translated by 谷歌翻译
尽管深度神经网络(DNNS)在音频分类任务中取得了巨大的成功,但它们的不确定性校准仍未得到探索。当它确定其预测时,应进行良好的模型应准确,并表明何时可能不准确。在这项工作中,我们研究了深度音频分类器的不确定性校准。特别是,我们从经验上研究了流行校准方法的性能:(i)蒙特卡洛辍学方法,(ii)集合,(iii)局灶性损失和(iv)光谱范围差异高斯工艺(SNGP),在音频分类数据集上。为此,我们评估了(I-IV),以应对环境声音和音乐流派分类的任务。结果表明,未校准的深度音频分类器可能过于自信,并且SNGP在本文的两个数据集中表现最好,并且非常有效。
translated by 谷歌翻译
人体肢体运动跟踪和识别在医疗康复训练,下肢辅助,截肢者的假肢设计,辅助机器人的反馈控制等中起着重要作用。轻质可穿戴的传感器,包括惯性传感器,表面肌电图传感器以及柔性应变/压力,柔性应变/压力,有望成为下一代人类运动捕获装置。本文中,我们提供了一种无线可穿戴设备,该设备由16通道柔性海绵的压力传感器阵列组成,通过检测由小腿胃gastrocnemius肌肉作用引起的人类皮肤上的轮廓来识别各种人类下肢运动。每个感应元件都是薄碳纳米管/聚二甲基硅氧烷纳米复合材料的圆形多孔结构,直径为4 mm,厚度约为400 {\ mu} m。招募了十个人类受试者,以执行十个不同的下肢运动,同时佩戴开发设备。用支持向量机方法的运动分类结果显示,所有十项测试的动作的宏记录约为97.3%。这项工作证明了具有下肢运动识别应用的便携式可穿戴肌肉活动检测装置,可以在辅助机器人控制,医疗保健,体育监测等中使用该设备。
translated by 谷歌翻译